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O ' Abstract 

(N : 

In this paper we study the problem of computing max-entropy distributions over a discrete set of 
O j 1 objects subject to observed marginals. Interest in such distributions arises due to their applicability in ar- 

.^ \ eas such as statistical physics, economics, biology, information theory, machine learning, combinatorics 

and, more recently, approximation algorithms. A key difficulty in computing max-entropy distributions 
has been to show that they have polynomially-sized descriptions. We show that such descriptions exist 
under general conditions. Subsequently, we show how algorithms for (approximately) counting the un- 
derlying discrete set can be translated into efficient algorithms to (approximately) compute max-entropy 
distributions. In the reverse direction, we show how access to algorithms that compute max-entropy 
i—i \ distributions can be used to count, which establishes an equivalence between counting and computing 

c/3 • max-entropy distributions. 
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1 Introduction 

In this paper we study the computability of max-entropy probability distributions over a discrete set. Con- 
sider a collection j$ of discrete objects whose building blocks are the elements [m] = {1,2, .. . ,ra}; thus, 
•dt ^ {0, l} m . Suppose there is some unknown distribution p on jtft and we are given access to it via observ- 
ables 8, the simplest of which is the probability that an element is present in a random sample M from p; 
namely, Fm^ p [e G M] = 8 e . If 8 is all we know, what is our best guess for pi The principle of max-entropy 
ifTTl [T8l postulates that the best guess is the distribution which maximizes (Shannon) entropylJ Roughly, 
the argument is that any distribution which has more information must violate some observable, and a dis- 
tribution with less information must implicitly use additional independent observables, hence contradicting 
their maximality. Access to such a distribution could then be used to obtain samples which conform with 
the observed statistics and to obtain the most informed guess to further statistics. 

Given the fundamental nature of such a distribution, it should not be surprising that it shows up in various 
areas such as statistical physics, economics, biology, information theory, machine learning, combinatorics 
and, more recently, in the design of approximation algorithms, see for instance [26]. From a computational 
point of view, the question is how to find max-entropy distributions. Note that the entropy function is 
concave, hence, the problem of maximizing it over the set of all probability distributions over ^# with 
marginals 6 is a convex programming problem. But what is the input? If 8 and j% are given explicitly, 
then a solution to this convex program can be obtained, using the ellipsoid method, in time polynomial in 
\^\ and the number of bits needed to represent 8u However, in most interesting applications, while 8 is 
given explicitly, ^ may be an exponentially-sized set over the universe {0, 1}'" and specified implicitly. For 
example, the input could be a graph G = (V,E) with m edges and 8 6 W" whereas M could be all spanning 
trees or all perfect matchings in G; in such a scenario |^#| could be exponential in m. This renders the convex 
program for computing the max-entropy distribution prohibitively large. Moreover, simply describing the 
distribution could require exponential space. The good news is that one can use convex programming duality 
to convert the max-entropy program into one that has m variables. Additionally, under mild conditions on 
8, strong duality holds and, hence, the max-entropy distribution is a product distribution, i.e., there exist 
Ye for e € [m] such that for all M £ M, Pm x TleeMYe: see HI or Lemma 1231 Thus, the max-entropy 
distribution for 8 can be described by m numbers y = (y e ) ee [ m ] ■ There are two main computational problems 
concerning max-entropy distributions. The first is to, given 8 and implicit access to ^#, obtain y such that 
the entropy of the product distribution p, corresponding to y, is close to that of the max-entropy distribution, 
and the observables obtained from p are close to 8 . The second is, given y, obtain a random sample from 
the distribution p. The second problem can be handled by invoking the equivalence between approximate 
counting and sampling due to Jemim, Valiant and Vazirani J221 and, hence, we focus on the first issue of 
computing approximations to the max-entropy distributions^] However, the existence of y which requires 
polynomially-many bits in the input size is, a priori, far from clear. This raises the crucial question of 
whether good enough succinct descriptions exist for max-entropy distributions. 

While there is a vast amount of literature concerning the computation of max-entropy distributions (see 
for example the survey ll37l ). previous (partial) results on computing max-entropy distributions required 
exploiting some special structure of the particular problem at hand. In theoretical computer science, interest 
in rigorously computing max-entropy distributions derives from their applications to randomized rounding 
and the design of non-trivial approximation algorithms, notably to problems such as the symmetric and the 
asymmetric traveling salesman problem (TSP/ATSP). For example, using a veiy technical argument, CD 



'Recall that the Shannon entropy of a distribution p = {pu)Me.W ls H{p) = La/s./# Pm In -jj- ■ 
2 The ellipsoid algorithm requires a bounding ball which in this case is trivial since VM £ ^, < pm < 1- 
To be precise, this equivalence between random sampling and approximate counting holds when the combinatorial problem at 
hand is self-reducible, see also 1321 . 



give an algorithm to compute the max-entropy distribution over spanning trees of a graph. This algorithm 
was then used by them to improve the approximation ratio for ATSP, and by QUI to improve the approxi- 
mation ratio for (graphical) TSP, making progress on two long-standing problems. Subsequently, the ability 
to compute max-entropy distributions over spanning trees has also been used to design efficient privacy 
preserving mechanisms for spanning tree auctions by lTT6l . In another example, [2] show how to compute 
max-entropy distributions over perfect matchings in a tree and use it to design approximation algorithms 
for a max-min fair allocation problem. The question of computing max-entropy distributions over perfect 
matchings in bipartite (and general) graphs, however, has been an important open problem. Recent applica- 
tions of the ability to compute max-entropy distribution over perfect matchings in bipartite graphs include 
new approaches for TSP and ATSP, see [36] and Section @] 

For counting problems, it is rare to obtain algorithms that can count exactly, notable exceptions being 
the problem of counting spanning trees in a graph ll28l or counting certain problems restricted to trees using 
dynamic programming. Most natural counting problems turn out to be #P-hard including the problem to 
count the number of perfect matchings in a bipartite graph ll34l[35l . The goal then shifts to finding algorithms 
that approximately count up to any fixed precision Il27ll33l . Here the most successful technique has been 
the Markov chain Monte Carlo (MCMC) method [201 which, when combined with the equivalence between 
approximate counting and sampling [2211 leads to many approximate counting algorithms. The technique 
has been applied to many problems including counting perfect matchings in a bipartite graph [21], counting 
bases in a balanced matroid [11], counting solutions to a knapsack problem [6] and counting the number of 
colorings in restricted graph families [19]. However, the problem of obtaining approximate counting oracles 
for several problems remains open as well; perhaps a prominent example is that of (approximately) counting 
the number of perfect matchings in a general graph. 

In combinatorics, max-entropy distributions are often referred to as hard-core or Gibbs distributions and 
have been intensely studied. While it is nice that the hard-core distribution has a product form, i.e., pm x 
TleeMYe: the question of interest here is whether one can upper bound the j e s. Structurally, such a bound 
implies that hard-core distributions exhibit a significant amount of approximate stochastic independence. In 
an important result, ll25l proved such a bound for the hard-core distribution over matchings in a graph. This 
led to resolving several questions involving asymptotic graph and hypergraph problems. For instance results 
of |[23l l25l l24l prove that the fractional chromatic index of a graph asymptotically behaves as its chromatic 
index. However, the argument of ||25| is quite difficult and seems to be specific to the setting of matchings 
leaving it an interesting problem to understand under what conditions can one obtain upper bounds on y e s. 

1.1 Our Contribution 

We first show that good enough succinct representations exist for max-entropy distributions. Subsequently, 
we give an algorithm that computes arbitrarily good approximations to max-entropy distributions given 6 
and access to a suitable counting oracle for ^#. Our algorithm is efficient whenever the corresponding count- 
ing oracle is efficient. Moreover, the counting oracle can be approximate and/or randomized. This allows 
us to leverage a variety of algorithms developed for several #P-hard problems to give algorithms to compute 
max-entropy distributions. Consequently, we obtain several new and old results about concrete algorithms to 
compute max-entropy distributions. Interesting examples for which we can use pre-existing counting oracles 
to obtain max-entropy distributions include spanning trees, matchings in general graphs, perfect matchings 
in bipartite graphs (using the algorithm from [21]) and subtrees of a rooted tree. The consequence for perfect 
matchings in bipartite graphs makes the algorithmic strategies for TSP/ ATSP mentioned earlier computa- 
tionally feasible, see Section [4] In the reverse direction we show that if one can solve, even approximately, 



4 We note that MCMC methods efficiently sample (and count) given a fixed y, usually for y = 1 corresponding to the uniform 
distribution. The goal in our problem is to find an y that maximizes the entropy. In fact, given a y, problem specific MCMC methods 
can be used to generate a random sample from ^ according to the product distribution corresponding to y. 



the convex optimization problem of computing the max-entropy distribution, one can obtain such counting 
oracles. This establishes an equivalence between counting and computing max-entropy distributions in a 
general setting. As a corollary, we obtain that the problem of computing max-entropy distributions over per- 
fect matchings in general graphs is equivalent to the, hitherto unrelated, problem of approximately counting 
perfect matchings in a general graph. 

1.1.1 Informal Statement of Our Results 

Before we describe our results a bit more technically, we introduce some basic notation. For ^ C {0, l} m , 
let P{^K) denote the convex hull of all ^# where each M G M is thought of as a 0/1 vector of length m, 
denoted \m ■ Thus, given a 6, for the max-entropy program to have any solution, 6 G P(^#). Since we are 
concerned with the case when ^# is given implicitly and may be of exponential size, we no longer hope to 
solve the max-entropy convex program directly since that may require exponentially many variables, one for 
each M G .#. Thus, we work with the dual to the max-entropy convex program. The dual has m variables 
and, if 6 is in the relative interior of P(^), the optimal dual solution can be used to describe the optimal 
solution to the max-entropy convex program, which is a product distribution. In fact, we assume one can 
put a ball of radius r\ around 6 and it still remains in the interior of P(^//). Importantly, our algorithm 
requires access to a generalized counting oracle for ^#, which given yean compute Y,Me^// MBelle'eMYe' 
for all e G [m] and also the sum Y.Me.4:Y\eeMje- We also consider the case when the oracle is approximate 
(possibly randomized) and for a given £, can output the sums above up to a multiplicative error of 1 ± e. 
The following is the first main result of the paper, stated informally here. 

Theorem 1.1 (Counting Implies Optimization, See Theorems 12.61 and I2.8I > There is an algorithm which, 
given access to a generalized (approximate) counting oracle for ^ C {0, l} m , a 6 which is promised to 
be in the r\ -interior of P(^#) and an e > 0, outputs a y such that its corresponding product probability 
distribution p is such that H(p) > (1 — £ /ri)H(p*) and for every e G [m] , 

\F M ^ p [eeM]-d e \<e. 

Here, p* is the max-entropy distribution corresponding to 6 . The number of calls the algorithm makes to 
the oracle is bounded by a polynomial in the input size, In !/rj and In l /e. 

A useful setting for T] and e to keep in mind is i/m 2 and l /m 3 respectively. The bit-lengths of the inputs to the 
counting oracle are polynomial in l /i] and, hence, the running time of our algorithm depends polynomially 
on !/?). If the generalized counting oracle is £-approximate, the same guarantee holds. Note that for many 
approximate counting oracles, the dependence on e on their running time is a polynomial in i/e. Hence, in 
this case the running time depends polynomially on x /e. Finally, note that this result can be easily gener- 
alized to obtain algorithms for the problem of finding the distribution that minimizes the Kullback-Leibler 
divergence from a given product distribution subject to the marginal constraints, see Remark l2.12l 

At a very high level, the algorithm in this theorem is obtained by applying the framework of the ellipsoid 
algorithm to the dual of the max-entropy convex program. While it is more convenient to work with the 
dual since it has m variables two issues arise: The domain of optimization becomes unconstrained and the 
separation oracle requires the ability to compute (possibly exponential sums) over subsets of j$ . While 
the counting oracles can be adapted to compute exponential sums, the unboundedness of the domain of 
optimization is an important problem. 

One of the technical results in the proof of the theorem above is structural and shows that this dual 
optimization problem has an optimal solution in a box of size m /r] when 6 is in the r\ -interior of P(^), see 
Theorem 12.71 Since y e s are exponential in the respective dual variables, there is an approximation y to the 
optimal solution to the max-entropy program, when 8 is in the r\ -interior of P(^f), such that the number 



of bits needed to represent each % is at most m /r\ . Such a result has been obtained for the special case of 
spanning trees by d and for matchings in a general graph by ll25l . 

Given that counting algorithms for many problems are still elusive, one may ask if they are really nec- 
essary to compute max-entropy distributions. The final result of this paper answers this question in the 
affirmative and establishes a converse to Theorem 11.11 

Theorem 1.2 (Optimizing Implies Counting, see Theorem l2.11l) There is an algorithm, which given or- 
acle access to an algorithm to compute an E-approximation to the max-entropy convex program for an 
r\ -interior point ofP(^), and a separation oracle for P(^#) , can compute a number Z such that 



{\-e)\J?\ <Z< (1 + e) 

The number of calls made to the max-entropy oracle is polynomial in the input size and l /e. 

This result can be extended to obtain generalized counting oracles, see Remark 12.121 For all polytopes 
of interest in this paper, separation oracles are known, see Section [3] Moreover, this result continues to 
hold even when the separation oracle is approximate, or weak. As a corollary, using a separation oracle for 
the perfect matching poly tope for general graphs |0 ED, we obtain that an algorithm to compute a good- 
enough approximation to the max-entropy distribution for any 6 in the perfect matching polytope of a graph 
G implies an FPRAS to count the number of perfect matchings in the same graph. 

1.2 Technical Overview 

The starting point for our results is the following dual to the max-entropy convex program: 

inf(A,e)+ln £ e"^' 1 *), (1) 

where \m is the indicator vector for M. When 6 lies in the relative interior of P(^#), then strong duality 
holds between the primal and the dualjj Hence, it follows from the first order conditions on the optimal 
solution pair Q?*, X*) that p* M oc e -\^*M f or eacn M G M . Suppose we know that 

1. X* is bounded, i.e., ||A*|| < /? for some /?, and 



2. there is a generalized counting oracle that allows us to compute the gradient of the objective function 

/(A) = (X, d) +hi£ Me j?e~^ ' at a specified X. The gradient at X, denoted Vf(X), turns out to be 
a vector whose coordinate corresponding toeG [m] is 

Then, using the machinery of the ellipsoid method, it follows relatively straight-forwardly that, for any e, we 
can compute a point X° such that f(X°) < f{X*) + £ with at most a poly(m, In i/e, In/?) calls to the counting 
oracle. Note that since the numbers fed into the counting oracle are of the form e~^ e , for each e G [m], the 
running time of the counting oracle depends polynomially on R rather than In/?. Thus, we need /? to be 
polynomially bounded. Hence, the question is: 

Can we bound ||A*||? 



'When 8 lies on the boundary of P(^£), the infimum in the dual is not attained for any finite A. 



Indeed, a significant part of the work done in [fl] was to bound this quantity for spanning trees and in ll25l for 
(not necessarily perfect) matchings in graphs. A priori it is not clear why there should be any such bound. 
In fact, we observe that if P(„#) lies in some low-dimensional affine space in W , the optimal solution is 
not unique and can be shifted in any direction normal to the space, see Lemma 1231 Thus, one can only 
hope for the optimal solution to be bounded once one imposes the restriction that A* lies in the linear space 
corresponding to the affine space in which P(^#) lives. One thing that works in our favor is that we have an 
absolute upper bound (independent of 6) on the optimal value of /(•), namely m. Roughly, this is because at 
optimality this quantity is an entropy over a discrete set of size at most 2 m . This implies that for all M € ./#, 

(A*,e)-(A*,l M )<m. (2) 

Using this, the rj -inferiority of 6, and the fact that the diameter of P{^) is at most y/m, it can then be 
shown that 

max (A*, 1m) - min (A*, 1m) < ■ 

Me^£ Me.j? T] 

Let us show how this immediately implies a bound on R when M corresponds to all the spanning trees of a 
graph with no bridge. Suppose T and T are two trees such that T is obtained from T by deleting an edge e 
and adding an edge /, then, 

\{X\\ T )-{X\lr)\ = \X:-X}\<^-. 

2 I — 

Thus, unless the graph has a bridge, this implies |A* — Xi\ < - ^— for all e,f E G. However, attempting 
a similar combinatorial argument for perfect matchings in a bipartite graph, where we do not have this 
exchange property, the bound is worse by a factor of 2'". 

Thus, we abandon combinatorial approaches and appeal to the geometric implication of © to obtain 
the desired polynomial bounding box for all F(^#). The argument is surprisingly simple and we sketch it 

here. One way to inteipret (|2} is that the vector X' = —^*/m has inner product at most 1 with v — 6 for all 
v G P(^#). For now, neglecting the fact that P(^#) might live in a lower dimensional affine space and that 
may not be in P(^#) , this implies that A' is in the polar of P(.^#) . However, since 6 is in the f] -interior of 
P(^#), P(^#) contains an £2 ball of radius at least T] inside it. Thus, the polar of P(^#) must be contained 
in the polar of this ball, which is nothing but an I2 ball of radius l /r\ . This gives a bound of l /r\ on the £2 
norm of X' and, hence, a bound of m /r] on the norm of A* as desired. 

Thus, the ellipsoid method can be used to obtain a solution A° such that f(X°) < /(A*) +£. Why 
should this approximate bound imply that the product distribution obtained using A is close in the marginals 
to 61 The observation here is that f(X°) — f(X*) is the Kullback-Leibler (KL) divergence between the 
two distributions. This implies a bound of \fe on the marginals using a standard upper bound on the total 
variation distance in terms of the KL-divergence. 

In the case when we have access only to an approximate counting oracle for ^#, things are more com- 
plicated. Roughly, the approximate counting oracle translates to having access to an approximate gradient 
oracle for the function /(•) and one has to ensure that A* is not cut-off during an iteration. Technically, we 
show that this does not happen and, hence, approximate counting oracles are equally useful for obtaining 
good approximations to max-entropy distributions. 

Finally, note that the (projected-)gradient descent approach (see ||29l ) can also be shown to converge in 
polynomial time and, possibly, can result in practical algorithms for computing max-entropy distributions. 
In the case when the counting oracle is approximate, one has to deal with a noisy gradient and the solution 
turns out to be similar to the one in the ellipsoid method-based algorithm in the presence of an approximate 
counting oracle. In addition to a bound on || A*||, one needs to bound the 2 — >• 2 norm of the gradient of /. 



While we omit the details of the gradient descent based-algorithm, we show that ||V/||2->2 is polynomially 
bounded, see Remark 12791 and Theorem IC.ll in Appendix ICl This bound may be of independent interest. 

We now give an overview of the reverse direction: How to count approximately given the ability to solve 
the max-entropy convex program for any point 6 in the 77 -interior of P(^). We start by noting that if we 

consider 6* = -r^r Y^MeJi ^m, then the optimal value of the convex program is In \j&\ . Thus, given access to 
this vertex-centroid of P(y#) one can get an estimate of |^#|. However, computing d* can be shown to be 
as hard as counting \~4%\, for instance, when j% consists of perfect matchings in a bipartite graph, see [flOl . 
We bypass this obstacle and apply the ellipsoid algorithm on the following (convex-programming) problem 

sup inf fg (A ) 

e * 

where fg (A) is the function in £Q) and where we have chosen to highlight the dependence on 6. The ellipsoid 
algorithm proposes a 6 and expects the max-entropy oracle to output an approximate value for mf^fg(X). 
This raises a few issues: First, given our result on optimization via counting, it is unfair to assume that we 
have such an oracle that works for all 6, irrespective of the inferiority of 6 in P{^Z). Thus, we allow queries 
to the oracle only when 8 is sufficiently in the interior of P(^#) . Note that our algorithm for computing 
the max-entropy distribution in our first theorem works under these guarantees. This requires, in addition, 
a separation oracle for checking whether a point is in the r\ -interior of P(.#). We construct such an tj- 
separation oracle from a separation oracle for P(^#). The latter, given a point, either says it is in P(^#) or 
returns an inequality valid for P(^#) but violated by this point. 

The second issue is that 6*, our target point, may not be in the f] -interior of P(„#). In fact, there may 
not be any point in the r\ -interior of P(^#) when r\ is !/poiy(m). However, under reasonable conditions on 
P(./#), which are satisfied for all polytopes we are interested in, we can show that there is a point 6' in the 
T7-interior of P(ytfC). This allows us to recover a good enough estimate of |^#| . Thus (the way we apply the 
framework of ellipsoid algorithm), we are able to recover a point close enough to 6' by doing a binary search 
on the target value of |^#|. As in the forward direction, because we assume that the max-entropy algorithm 
is approximate, we must argue that 6' is not cut-off during any iteration of the ellipsoid algorithm. 

We conclude this overview with a couple of remarks. First, unlike our results in the forward direction, 
we cannot replace the ellipsoid method based algorithm by a gradient descent approach. The reason is that 
we only have a separation oracle to detect whether a point is in P(^#) or not. Second, we can extend our 
result to show that, using a max-entropy oracle, one can obtain generalized approximate counting oracles, 
see Remark l2.12l 

1.3 Organization of the Rest of the Paper 

The rest of the paper is organized as follows. In Section [2] we formally define the objects of interest in our 
paper including the convex program for optimizing the max-entropy distribution and its dual. We also define 
counting oracles that are needed for solving the convex program. We then formally state our results and give 
a few lemmas stating properties about the optimal and near-optimal solution to the dual of the max-entropy 
convex program. In Section [3] we provide examples of some combinatorial polytopes to which our results 
apply. In Section 01 we show how certain algorithmic approaches for approximating the symmetric and the 
asymmetric traveling salesman problem are feasible as a result of one of the main results of this paper. In 
Section [2 we prove that there is an optimal solution to the dual of the max-entropy convex program that is 
contained in a ball of small radius around the origin. In Section[6l we use this bound on the optimal solution 
to show that counting oracles, both exact and approximate, can be used to optimize the convex program via 
the ellipsoid algorithm. In Section [7J we show the other direction of the reduction and give an algorithm that 
can approximately count given an oracle that can approximately solve the max-entropy convex program. 
Standard proofs are omitted from the main body and appear in Appendix El In Appendix |B] we show how 



generalized counting oracles can be obtained via max-entropy oracles. Here, we also introduce the program 
for minimizing the KL-divergence with respect to a fixed distribution. Finally, in Appendix O we give a 
bound on ||V/|| 2 ^2- 

2 Preliminaries 

2.1 Notation 

In this section we introduce the general notation used throughout the paper. Vectors are denoted by plain 
letters such as a,b,c,d,x,y,u and v and are over W 71 . We also use the Greek letters X,6,v and 7 to denote 
vectors. is sometimes used to denote the all-zero vector and the usage should be clear from context. For 
reasons emanating from applications, we choose to index the set [m] by e. Hence, the components of a 
vector are denoted by x e , X e , Q e , etc. We also use notation such as xo,Xi, . . . ,x t and Xq, Xy, . . . , Xj to denote 
vectors. It should be clear from the context that these are vectors and not their components. The Greek letters 
f] , a , j8 , £ , £ are used to denote positive real numbers. For a set M € {0, 1 } m , let \m denote the 0/ 1 indicator 
vector for M. We use Im(^) to denote its e-th component. Thus, 1m (e) = 1 if e € M and otherwise. The 
letters p,q and r are reserved to denote probability distributions over {0, 1}'". Of special interest are product 
probability distributions where, for M € {0, 1}'", the probability of M is proportional to Yl ee M7e f° r some 
vector 7. We denote such a probability distribution by p 7 to emphasize its dependence on 7, and let p M 
denote the probability of M. Additionally (x,y) denotes the inner product of two vectors, || • || denotes the 

def 

Euclidean norm and ||x||«, = max e€ r m i \x e \. We also use the notation X(M) to denote (X, 1m) for a vector X 
and M G {0, 1}'". |5| denotes the cardinality of a set. 

2.2 Combinatorial Polytopes, Separation Oracles, Counting Oracles and Interiority 

The polytopes of interest arise as convex hulls of subsets of {0, l} m for some m. For a set ^ C {0, l} m , the 
corresponding poly tope is denoted by P(^#). Thus, 



clef 



Y, Pm^m 'Pm>0, Y, p M = 1 \ ■ 



Another way to describe P{^) is to give a maximal set of linearly independent equalities satisfied by all its 
vertices, and to list the inequalities that define P(^#). Thus, F(^#) can be described by (A = ,b) and (A-,c) 
such that 

Vm£^# A = \M = b and A~Im<c. 

While the former set cannot be more than m, the latter set can be exponential in m and we do not assume 
that (A-,c) is given to us explicitly. 

Separation Oracles. On occasion we require an access to a separation oracle for P(^#) of the following 
form: Given A € W" satisfying A = X = b, the separation oracle either says that A-X < c or outputs an 
inequality (a',c') such that 

(a',X) > c . 

In fact, such an oracle is often termed a strong separation oraclejj 

Counting Oracles. The standard counting problem associated to ^ is to determine |^#|, i.e., the number 
of vertices of P(^#). We are interested in a more general counting problem associated to .y# where there is 



In our results that depend on access to a strong separation oracle, we can relax the guarantee to that of a weak separation oracle. 
We omit the details. 



a weight X e for each e G [m] and the weight of M under this measure is e A ( M ) . A generalized exact counting 
oracle for d£ then outputs the following two quantities: 

l-Z A =lM^^ AW and 

2. for every e G [m], Z A = I Me .#,M^~ A W. 

The oracle is assumed to be efficient, as in it runs in time polynomial in m and bits needed to represent e~^ 
for any e £ [m]i\ 

While efficient generalized exact counting oracles are known for some settings, for many problems of 
interest the exact counting problem is #P-hard. However, often, for these #P-hard problems, efficient oracles 
which can compute arbitrarily good approximations to the quantities of interest are known. Thus, we have 
to relax the notion to generalized approximate counting oracles which are possibly randomized. Such an 
oracle, given e,a > and weights A G W n , returns Z A and Z A for each e G [m]. The following guarantees 
hold with probability at least 1 — a, 

1. {\-e)Z x <Z A <(l + £)Z A and 

2. for every e G [m], (1 - £)Z A < 7.) < (1 + e)Z A . 

The running time is polynomial in m, l /e, log Y /a and the number of bits needed to represent e~ Xe for any e G 
[m]. For the sake of readability, we ignore the fact that approximate counting oracle may be randomized. The 
statements of the theorems that use randomized approximate counting oracles can be modified appropriately 
to include the dependence on a. Note that if the problem at hand is self-reducible, then having access to 
an oracle that outputs an approximation to just Z suffices. We omit the details and the reader is referred 
to a discussion on self-reducibility and counting in ll32l . Finally, it can be shown that, in our setting, the 
existence of a generalized (exact or approximate) counting oracle is a stronger requirement than the existence 
of a separation oracle. 

Interior of the Polytope. The dimension of P{^() is m — rank(A = ); the polytope restricted to this affine 
space is full dimensional. Since we work with polytopes that are not full dimensional, we extend the notion 
of the interior of the polytope P(j$) and use the following definition. 

Definition 2.1 For an r\ > 0, a point 6 is said to be in the r\ -interior ofP{^K) if 

{d' :A=d' = b, \\6-e'\\ <1)}CP(J^. 

We say that 6 is in the interior ofP(^) if 6 is in the r\-interior ofP(.^)for some r\ > 0. 

We are be interested in the case where r\ > { 1 , , . Hence, it is natural to ask if for every P(^), there is a 
point in its ol l ,-. -interior. The following lemma, whose proof appears in Section |7J asserts that the answer 
is yes if the entries of A- and c are reasonable (as is the case in all our applications). 

Lemma 2.2 Let ^ C {0, 1}"' and P(^#) = {i£ M'" : A = x = b, A-x < c} be such that all the entries in 
A- , c G A , • Z and their absolute values are at most poly (m) . Then there exists a 6 G P{^) such that 6 
is in the , , . -interior ofP(^). 

At this point, if one wishes, one can look at Section [3] for some examples of combinatorial polytopes we 
consider in this paper. 



7 To deal with issues of irrationality, it suffices to obtain the first k bits of Z and Z A in time polynomial in k and m. 
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2.3 The Maximum Entropy Convex Program 

In this section we present the convex program for computing max-entropy distributions. Let P(^#) be the 
polytope corresponding to ^#. While we do not care about whether we have an oracle for j$ in this section, 
the notion of interiority is important. 

For any point 6 G P(^) , by definition, it can be written as a convex combination of vertices of P(^#) , 
each of which is indicator vector for some M G ^K. Each such convex combination is a probability distribu- 
tion over M G */#. Of central interest in this paper is a way to find the convex combination that maximizes 
the entropy of the underlying probability distribution. Given 6, we can express the problem of finding the 
max-entropy distribution over the vertices of P(^) as the program in Figured] Here Oln I is assumed to 



sup Img^PmIii^ 
s.t. 

Me G [m] Img.//, M3ePM = Oe 
Y<MeJZPM = 1 

VMGJ' Pm > 

Figure 1: Max-Entropy Program for („<#, 6) 

be 0. This entropy function is easily seen to be concave and, hence, maximizing it is a convex programming 
problem. The following folklore lemma, whose proof appears in Appendix IA.H shows that if 9 is in the 
interior of P(^) , then the max-entropy distribution corresponding to it is unique and can be succinctly 

def 

represented. Recall the notation that for X : [m] H> R and M G ^#, X(M) = (X, 1m) = HeeM^e- 

Lemma 2.3 For a point 6 in the interior ofP(^), there exists a unique distribution p* which attains the 
max-entropy while satisfying 

Yj Pm 1 m = 0. 

Moreover, there exists a X* : [m] \— > R such that p* M oc e ( M > for each M G ^#. 

As we observe soon, while p* is unique, X* may not be. First, we record the following definitions about 
such product distributions. 

Definition 2.4 For any X G R m , we define the distribution p on j$ such that 

VM G Jt p M = "—^ where Z x M £ e~ X M . 
Z Ne.M 

The marginals of such a distribution are denoted by 6 and defined to be 

X def y". I ^e i ryl def r" 1 -k(M) 

Q e = L Pm = ^i where Z e = Jj e • 

Me.J?, M3e ^ Me. #, M3e 

The proof of this lemma relies on establishing that strong duality holds for computing the max-entropy 
distribution with marginals 6 for the convex program in Figure [Q The dual of this program appears in 
Figure|2] Thus, if 6 is in the interior of P(^#), then there is a X* such that p* = p^* and fe(X*) = H{p*). 

Note that X* may not be unique and, finally, that an important property of the dual objective function is 
that fe does not change if we shift by a vector in the span of the rows of A = . This is captured in the following 
lemma whose proof appears in Appendix IA.2I 
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Lemma 2.5 fe(X) = fe{X + (A - ) 7 d) for any d . 

Thus, we can restrict our search for the optimal solution to the set {X G W n : A = X = 0}. In this set there is 
a unique X* which achieves the optimal value since the constraints (A = ,b) are assumed to form a maximal 
linearly independent set. We refer to this A* as the unique solution to the dual convex program. 

inf f (X) = E«= W (U. + lnL^^W 
s.t. 

Ve G [m] X e £R 

Figure 2: Dual of the Max-Entropy Program for (^#, 8) 



2.4 Formal Statement of Our Results 

Our first result shows that if one has access to an generalized exact counting oracle then one can indeed 
compute a good approximation to the max-entropy distribution for specified marginals. 

Theorem 2.6 There exists an algorithm that, given a maximal set of linearly independent equalities (A = ,b) 
and a generalized exact counting oracle for P(^) C W n , a G in the T] -interior of P(^#) and an £ > 0, 
returns a X° such that 

h(X°)<f e (X*) + £, 

where X* is the optimal solution to the dual of the max-entropy convex program for (^#, 6) from Figure\2\ 
Assuming that the generalized exact counting oracle is polynomial in its input parameters, the running time 
of the algorithm is polynomial in m, 1/tj, log i/e and the number of bits needed to represent 6 and (A = ,b). 

The proof of this theorem follows from an application of the ellipsoid algorithm for minimizing the dual 

p°iy('") 
convex program. At a first glance, it may seem enough to show that ||A*|| < 2 i since the number of 

iterations of the ellipsoid algorithm depends on log ||A*||. Unfortunately, this is not enough since each call 

to the oracle with input X takes time polynomial in the number of bits needed to represent e~ e for any 

e € [m\. We show the following theorem which provides a polynomial bound on ||A*||. 

Theorem 2.7 Let 6 be in the r\ -interior of P(^#) C R m . Then there exists an optimal solution X* to the 
dual of the max-entropy convex program such that \\X*\\ < ~. 

We specifically note that the proof of this theorem needs that X* satisfies A = X* = 0. Combinatorially, it is an 
interesting open problem to see one can get such a bound depending only on 1/77 . Next we generalize Theo- 
rem |Z6] to polytopes where only an approximate counting oracle exists, for example, the perfect matching 
problem in bipartite graphs. While we state this theorem in the context of deterministic counting oracles, it 
holds in the randomized setting as well. 

Theorem 2.8 There exists an algorithm, that given a maximal set of linearly independent equalities (A = , b) , 
a generalized approximate counting oracle for P{^M~) C M"\ a G in r\ -interior of P(^) and an £ > 0, returns 
a X° such that 

fe(X°)<fe(X*) + £. 

Here X* is an optimal solution to the dual of the max-entropy convex program for (^#, 6). Assuming that 
the generalized approximate counting oracle is polynomial in its input parameters, the running time of the 
algorithm is polynomial in m, l /r\, l /e and the number of bits needed to represent 6 and (A = ,b). 
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It can be shown that once we have a solution A° to the dual convex program such that 

fe(n<fe(^) + £ 



as in Theorems 12.61 and 12.81 one can show that the marginals obtained from the distribution corresponding 
to A° is close to that of A* (which is 6), i.e., || A ° — 6^ < 0(y/e). See Appendix |A.2[ and in particular 
Corollary IA.5l for a proof. 



Remark 2.9 We can also obtain proofs of Theorems \2. 6\ and \2.8\ by applying the framework of projected 
gradient descent. (See Section 3.2.3 in H29V for details on the gradient descent method.) For the gradient 
descent method to be polynomial time, one would need an upper bound on ||A*|| and ||V/||2->2- The first 
bound is provided by Theorem 12. 71 and the second bound is proved in Theorem \C.l\ in Appendix\C\ We have 
chosen the ellipsoid method-based proofs of Theorems \2. 6\ and \2. 8\ since the ellipsoid method is required in 
the proof of Theorem \2.11\ 

Our final theorem proves the reverse: If one can compute good approximations to the max-entropy convex 
program for P(^) for a given marginal vector, then one can compute good approximations to the number 
of vertices in P(./#). First, we need a notion of a max-entropy oracle for ^ . 

Definition 2.10 An approximate max-entropy oracle for ^#, given a 8 in the f] -interior ofP(^), a £ > 0, 
and an £ > 0, either 

1. asserts that inf^ fe(X) > £ — £ or 

2. returns a X G ~R m such that /e (A ) < C, + £. 

The oracle is assumed to be efficient, i.e., it runs in time polynomial in m, l /e, !/rj and the number of bits 
needed to represent £. 

This is consistent with the algorithms given by Theorem I2.6l and [278l 

Theorem 2.11 There exists an algorithm that, given a maximal set of linearly independent equalities (A = , b) 
and a separation oracle and an approximate optimization oracle for j% as above, returns a Z such that (1 — 
£)\^i\ < Z < (1 + e)|^#|. Assuming that the running times of the separation oracle and the approximate 
max-entropy oracle are polynomial in their respective input parameters, the running time of the algorithm 
is bounded by a polynomial in m, l /e and the number of bits needed to represent (A = , b) . 

Analogously, one can easily formulate and prove a randomized version of Theorem 12.111 we omit the de- 
tails. As an important corollary of this theorem, if one is able to efficiently find approximate max-entropy 
distributions for the perfect matching polytope for general graphs, then one can approximately count the 
number of perfect matchings they contain. Both problems have long been open and this result, in particular, 
relates their hardness. 

Remark 2.12 One may ask if Theorem 12. 1 1 1 can be strengthened to obtain generalized approximate count- 
ing oracles from max-entropy oracles. The question is natural since Theorems \2. 6\ and \2. 8\ assume access 
to generalized counting oracles. The answer is yes and is provided in Theorem \B.2\ in Appendix \B\ It turns 
out that one needs access to a generalized max-entropy oracle, an oracle that can compute the distribution 
that minimizes the KL-divergence with respect to a fixed product distribution and a given set of marginals. 
These latter programs are shown, in Appendix^ to be no more general than max-entropy programs. In fact, 
analogs of Theorems \2.6\ and \2.8\ can be proved for min-KL-divergence programs rather than max-entropy 
programs, see Theorem \B.5\ 
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2.5 The Ellipsoid Algorithm 

In this section we review the basics of the ellipsoid algorithm. The ellipsoid algorithm is used in the proofs 
of our equivalence between optimization and counting: Both in the proof of Theorems I2.6l and I2.8l and in 
the proof of Theorem l2.HI Consider the following optimization problem where g(-) is convex and &,•(•) are 
affme functions. 

inf g(X) 
s.t. 
Vl</<£ ht(X) = 

We assume that g is differentiable everywhere and that its gradient, denoted by Vg, is defined everywhere. 
In our application, for a polytope F(^#) and a 6 in the 17 -interior of F(^#), g = fe, the objective function 
in the dual program of Figure |2] The hi(-)s are the constraints A = X = 0, where (A = ,c) is the maximal set of 
linearly independent equalities satisfied by the vertices of j$ . Thus, as noted in Lemma 1231 we can restrict 
our search for the optimal solution to the set K which is defined to be 

K = {XeR m :A=X=0}. 

Note that £ K. The ellipsoid algorithm can be used to solve such a convex program under fairly general 
conditions and we first state a version of it needed in the proof of Theorem 12.61 A crucial requirement is a 
strong first-order oracle for g which is a function such that given a X, outputs g(X) and Vg(X). Since we 
are only interested in X G K, and we are given the equalities describing K explicitly, we assume that we can 
project Vg(X) to K. By abuse of notation, we denote the latter also by Vg(X). 

The following theorem claims that if one is given access to a strong first-order oracle for g, one can use 
the ellipsoid algorithm to obtain an approximately optimal solution to the convex program mentioned above. 
This statement is easily derivable from O (Theorem 8.2.1). 

Theorem 2.13 Given any j8 > and R > 0, there is an algorithm which, given a strong first-order oracle 
for g, returns a point X' G W" such that 

g{X')< inf s(A)+j3| sup g(X)- inf g(X)\ . 
XzK, ||A,[|.<* \a,6*. ||A||..<* A,e*, ||A|k<* 7 

The number of calls to the strong first-order oracle for g are bounded by a polynomial in m, log R and log i/j8 . 

While we do not explicitly describe the ellipsoid algorithm here, we need the following basic properties 
about minimum volume enclosing ellipsoids which forms the basis of the ellipsoid algorithm. A set E C W" 

def 

is an ellipsoid if there exists a vector a G W and a positive definite m x m-matrix A such that E = E(A, a) = 
{x G W" : (x — a) T A~ l (x — a) < 1}. We also denote Vol (is) to be the volume enclosed by the ellipsoid E. 
The following theorem follows from the Lowner-John Ellipsoid. We refer the reader to 11131 for more details. 



Theorem 2.14 Given an ellipsoid E(A,a) and a half-space {x : (c,x) < {c,a)} passing through a there 

Vol(g') 

Vol(E) 



exists an ellipsoid E' D E(A,a) n {x : (c,x) < (c,a)} such that V ",',J < e 2>» 



In our applications of this theorem, we in fact need the ellipsoid to be in an affine space of dimension 
possibly lower than m. The definitions and the theorem continue to hold under such a setting. 
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3 Examples of Combinatorial Polytopes 

If one wishes, one can keep the following combinatorial polytopes in mind while trying to understand and 
interpret the results of this paper. 

The Spanning Tree Polytope. Given a graph G = (V,E), let 

Jt = | l r G R |£| : T C E is a spanning tree of g\ . 
It follows from a result of Edmonds ||9] that 

P(^) = Ix £R l > l : x(E(V)) = \V\ - 1, x(E(S))< \S\-l VSCyj 

where, for 5 C V, E(S) = {e = {u,v} G E : {u,v}C\S = {u,v}} and, for a subset of edges H C.E, x(H) = 
HeeH x e- Edmond (8j also shows the existence of a separation oracle for this polytope. A generalized exact 
counting oracle is known for this spanning tree polytope via Kirchoff's matrix-tree theorem, see lTT4l . 

The Perfect Matching Polytope for Bipartite Graphs. Given a bipartite graph G = (V,E), let 

def f IF 1 

M = < 1m G R 1 ' : M is a perfect matching in G > . 
It follows from a theorem of Birkhoff |[T3l that, when G is bipartite, 

P(Jt) = Ix G R§ : jc(5(v)) = 1 Vv G v} 

def 

where, for v G V, S(v) = {e = {u,v} G E}. Here, it can be shown that all the facets, i.e., the defining 
inequalities, are one of the set of 2m inequalities < x e < 1 for all e G [m] . The exact counting problem is 
#P-hard and while a (randomized) generalized approximate counting oracle follows from a result of Jerrum, 
Sinclair and Vigoda ||2T1 for computing permanents. 

The Cycle Cover Polytope for Directed Graphs. Given a directed graph G = (V,A), let 

Jt = | \m G R' A ' : M is a cycle cover in G \ . 

A cycle cover in G is a collection of vertex disjoint directed cycles that cover all the vertices of G. The 
corresponding cycle cover polytope is denoted by P(^#) . This polytope is easily seen to be a special case of 
the perfect matching polytope for bipartite graphs as follows. For G = (V,A), construct a bipartite graph H = 
(Vl,Vr,E) where Vl = Vr = V. For each vertex v G V we have vl G Vl and v# G Vr. There is an edge between 
ul G Vl and vr G Vr in H if and only if (u, v) G A . Thus, there is a one-to-one correspondence between cycle 
covers in G and perfect matchings in H. Hence, the |[2T1 algorithm gives a generalized approximate counting 
oracle in this case as well. 

The Perfect Matching Polytope for General Graphs. Given a graph G = (V,E), let 

def ( \F\ 1 

J 1 = 1 M £ M. 1 : M is a perfect matching in G > . 

A celebrated result of Edmonds [7| states that 

( i i 151 — 1 1 

= i x G Rg| : jc(5(v)) = 1 Vv G V, x(E(S)) < ^— VS C V, \S\ odd I . 

The separation oracle for this polytope is non-trivial and follows from the characterization result of Ed- 
monds. A direct separation oracle was also given by Padberg and Rao 0T1 . Coming up with a counting 
oracle for this polytope, even with uniform weights which counts the number of perfect matchings in a 
general graph, is a long-standing open problem. 
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4 New Algorithmic Approaches for the Traveling Salesman Problem 

Max-entropy distributions over spanning trees have been successfully applied to obtain improved algorithms 
for the symmetric QUI as well as the asymmetric traveling salesman problem [1 ]. We outline here a different 
algorithmic approach, using max-entropy distributions over cycle covers, which becomes computationally 
feasible as a consequence of our results. Let us consider the asymmetric traveling salesman problem (ATSP). 
We are given a complete directed graph G = (V,E) and cost function c : E — > R>o which satisfies the directed 
triangle inequality. The goal is find a Hamiltonian cycle of smallest cost. First, we formulate the following 
subtour elimination linear program in Figure |3] 



min HeeE c eX e 

s.t. 

VvEV x(8+(v))=x(8-(v)) = l 
V5CV x(S + {S))>l 

VeeE < x e < 1 

Figure 3: Subtour Elimination LP for G = (V,E) and c 

Here, for a vertex v, 8 + (v) is the set of directed edges going out of it and 5~(v) is the set of directed 
edges coming in to v. Let x* denote the optimal solution to this linear program. The authors of [I] make the 

observation that 6 m , = ^- (x* v +x* u ) defined on the undirected edges is a point in the interior of the spanning 
tree polytope on G. The algorithm then samples a spanning tree T from the max-entropy distribution with 

marginals as given by 6 and crucially relies on properties of such a T to obtain an O ( lo °f", \ ) -approximation 
algorithm for the ATSP problem. 

Interestingly, there is another integral polytope in which x* is contained. Consider the convex hull P of 
all cycle covers of G, see Section [3] Then, 



[xeR^:x(5 + (v))=x(8-(v)) = l}. 



It is easy to see that x* G P. Similar to the cycle cover algorithm of Frieze et al lTT2l . the following is a 
natural algorithm for the ATSP problem. 

Randomized Cycle Cover Algorithm 

1. Initialize H ^— 0. 

2. While G is not a single vertex 

• Solve the subtour elimination LP for G to obtain the solution x*. 

• Sample a cycle cover 'if from the max-entropy distribution with marginals x*. 

• Include in H all edges in < €, i.e., H ^HU (U CeV C) . 

• Select one representative vertex vc in each cycle Ce? and delete all the other vertices. 

3. Return//. 

Before analyzing the performance of this algorithm, a basic question is whether this algorithm can be im- 
plemented in polynomial time. As an application of Theorem 12.81 to the cycle cover polytope for directed 
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graphs, it follows that one can sample a cycle cover from the max-entropy distribution in polynomial time 
and, thus, the question is answered affirmatively. The generalized (randomized) approximate counting or- 
acle for cycle covers in a graph follows from the work of [21 1. The technical condition of interiority of x* 
can be satisfied with a slight loss in optimality of the objective function. The analysis of worst case perfor- 
mance of this algorithm is left open, but to the best of our knowledge, there is no example ruling out that the 
Randomized Cycle Cover Algorithm is an 0(l)-approximation. Similarly, the application of Theorem 12.81 
to the perfect matching polytope in bipartite graphs makes the permanent-based approach suggested in ll36l 
for the (symmetric) TSP computationally feasible. 

5 Bounding Box 

In this section, we prove Theorem I2.7l and show that there is a bounding box of small radius containing the 
optimal solution X*. We begin with the following lemma. 

Lemma 5.1 Let 6 be a point in the r\ -interior ofP(^) C W" and let X* be the optimal solution to the dual 
convex program. Then for any x G P(^) 

(X*,9-x) <m. 

Proof: First, note that the supremum of the primal convex program over all 6 is In |^#| < m. Hence, from 
strong duality it follows that /(A*) < m. This implies that 

f(V) = (X\e)+ln £ e- A *W=ln £ e <**.<»-A*(*) < m . 

Hence, for every M G */#, 

(X*,0)-X*(M)<m. (3) 

Since x G P(J%), we have x = Y*m^J( t m^m where £ Me _^ tm = 1 and tm > for each M G ^#. Multiplying 
® by rM and summing over M we get 

£ r M ({X*,0)-X*(M))< £ r M m. 

This implies that 

(X\d)- £ r M X\M)<m. 
MeJ( 

As a consequence, we obtain that (X*, 0) — (X*,x) < m, completing the proof of the lemma. ■ 

Proof of Theorem 12.71 Recall that A = x = c denotes the maximal set of independent equalities satisfied by 
P{^). We now define the following objects. Let 

B = {x£R m :A=x = c, \\x-6\\ < 7]} 

be the ball centered around B restricted to the affine space A = x = c of radius r\. Since 6 is in the T] -interior 
of P{J(), we have that B C P(Jt). Let 

£ = {yGlR m :A=j = c, \\y- d\\ < i/„} 
be the ball centered around 6 of radius i in the same affine space and let 

Q = {z£R m :A=z = c, (z-e,x-6) < 1 VxeB}. 
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Lemma 5.2 Q = Q. 

Proof: We first prove that QCg. Let y G Q. The constraints A = y = c are clearly satisfied since y G Q. For 
any x £ B, 

(y-e,x-e) <lb-0||||jc-e|| <-•*? = i. 

Thus, y £ Q. Now we show that QCg. Let zSg. The constraints A = z = c are clearly satisfied since z G (2- 

Now consider 

, def a , Z — 9 

^ = o + 1R Tfl||-n- 

We have that 

z — 



Moreover, 



k'-0| 



e+ CT-"- e 



\\z-e\\ 

Thus, z' G S. Hence, we must have (z — 6,z' — 6} < 1. This implies that 

z-0 \ 



z-0|| 



and, therefore, 

Thus, z G 2 completing the proof. 
We now show that 

To see this, first observe that 



|z-0||< 1 A- 



A=-A*/m+0€ fi. 



A - A = -a = A7,„+a-0 =0 + c. 



Here we have used the fact that A A* = 0, see Lemma 1231 We now verify the second condition. Let x G B. 
Then 



m m 



where the last inequality follows from that fact that x G B C P(^) and Lemma 15.11 Thus, A G 2 and 
therefore, we must have ||A — 0|| < l /rj. Therefore, ||^*/m|| < l /n proving Theorem 12.71 

6 Optimization via Counting 



In this section, we prove Theorems 12.61 and 12.81 The proof of both the theorems rely on the bounding box 
result of Theorem I2.7l and employs the framework of the ellipsoid algorithm from Section 1231 
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6.1 Proof of Theorem US 

We first use Theorem 12. 131 to give a proof of Theorem 12.61 The algorithm assumes access to a strong first- 
order oracle for P(^#). We then present details of how to implement a strong first-order oracle using an 
generalized exact counting oracle. Suppose A* is the optimum of our convex program. Theorem 12.7 l implies 

def 

that for 1 1 A* 1 1 co < m /V Thus, we may pick the bounding radius to R = m /r\ and it does not cut the optimal A* 
we are looking for. The only thing left to choose is a /3 such that 



P< 



su Pag^, ||A|U<R/e(A)-inf Ae ^ jA|U<s/eW 



This would imply that the solution A° output by employing the ellipsoid method from Theorem l2.13l is such 
that 

f e {X°)<fe{V) + e. 

To establish a bound on j8 , start by noticing that inf x fg (A ) > 0. This follows from weak-duality and the fact 
that entropy is always non-negative. On the other hand we have the following simple lemma. 

Lemma 6.1 supp|| oo<R /e(A) < (2m + l)R. 
Proof: 



/ e (A)<|(A,0)| + 



In £ e 



<mR + ln(2 m e mK )<(2m+l)R. 



-X{M) 

MeJ( 

Here we have used the fact that \j%\ < 2" 1 and that d e £ [0, 1] for each e£ [m\. 



Thus, jS can be chosen to be / 2m e + \)R ■ Hence, the running time of the ellipsoid method depends polynomially 
on the the time it takes to implement the strong first-order oracle for fg and log ^ . 
Since / e (A) = (A, 0) +lnI Me ,^^ A W = (A, 0) + lnZ\ it is easily seen that 

v , P -X{M) 7 X 

V /e (^ )e — Ve ^ TTTaA _ ^ ~ ~^T ~ U <* ~ " 



ZNe^e-^) " e Z^ 



e 



Hence, V/ e (A) = — . Recall that the strong first-order oracle for fg requires, for a given A, /e(A) 
and V/e(A). The generalized exact counting oracle for P(^#) immediately does it as it gives us Z^ for all 
e G [m] and Z . This allows us to compute /e(A) and V/e(A) in one call to such an oracle. In addition we 
also need time proportional to the number of bits needed to represent 6 . 

Thus, the number of calls to the counting oracle by the ellipsoid algorithm of Theorem 12. 13l is bounded 
by a polynomial in m, log/? and log 1 /e. Since each oracle call can be implemented in time polynomial in m 
and Ru this gives the required running time and concludes the proof of Theorem l2.6l 

6.2 Proof of Theorem HJ 

Now we give the ellipsoid algorithm that works with a generalized approximate counting oracle and prove 
Theorem 12.81 Here, the fact that the counting oracle is approximate means that the gradient computed as 
in the previous section is approximate. Thus, this raises the possibility of cutting off the optimal A* during 
the run of the ellipsoid algorithm. We present the ellipsoid algorithm to check, given a 6 and a £, whether 
|/e(A*) — Q < e. The technical heart of the matter is to show that when |/e(A*) — Q < e, A* is never cut 



Here we ignore the fact that e ' can be irrational. This issue can be dealt in a standard manner as is done in the implementation 
details of all ellipsoid algorithms. See |15| for details. 
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off of the successive ellipsoids obtained by adding the approximate gradient constraints. Moreover, in this 
case, once the radius of the ellipsoid becomes small enough, we can output its center as a guess for A*. Since 
the radius of the final ellipsoid is small and contains A*, the following lemma, which bounds the Lipschitz 
constant of fg, implies that the value of fg at the center of the ellipsoid is close enough to fg(X*). 

Lemma 6.2 For any A , A' 

f e {X)-f e (X')<2^hl\\X-X'\\. 

Proof: We have 

MX)-fe(X') = (d,X-X')+ln ^ 

e -X(M) 
< || 1| || A -A' ||+ In max 



meJ' e 



-X'(M) 



< Vm||A-A'|| + max(A'(M)-A(M)) 

< >//n||A-A'||+V/n||A-A'|| <2y/m\\k-X'\\ 

which completes the proof. Here we have used the Cauchy-Schwarz inequality in the first and third inequal- 
ities and in the second inequality we have used the fact that 6 G [0, 1]'". ■ 



Proceeding to the ellipsoid algorithm underlying the proof of Theorem 12.81 we do a binary search on the 
optimal value fg(X*) up to an accuracy of e /s. For a guess £ G (0,m], we check whether the guess is correct 
with the following ellipsoid algorithm^ 

1. Input 

(a) An error parameter e > 0. 

(b) An inferiority parameter r\ > 0. 

(c) A 6 which is guaranteed to be in the tj -interior of P(^). 

(d) A maximally linearly independent set of equalities (A = ,b) for P(^#). 

(e) A generalized approximate counting oracle for j% . 

(f) AguessCe(0,m](for/ e (A*)). 

2. Initialization 

def 

(a) Let Eq = E(Bq,co) be a sphere with radius R = m fr\ centered around the origin (thus, containing 
X* by Theorem |2.71 i and restricted to the affine space A = x = 0. 

(b) Set? = 

3. Repeat until the ellipsoid E t is contained in a ball of radius at most jr-i= ■ 

(a) Given the ellipsoid E t = E(B t ,c t ), set A* = c t . 

(b) Compute Q using the counting oracle such that fg (X t ) — e /s < Q < fg (X t ) + e /s. 

(c) If C < C 

i. then return Xt and stop. 



For ease of analysis, we assume that all oracle calls are answered correctly with probability 1 . The failure probability can be 
adjusted to arbitrary precision with a slight degradation in the running time. 
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ii. else 

A. Compute B t such that \\B t — 6^' ||i < e /i6# using the counting oracle. 

B. Compute the ellipsoid E t+ \ to be the smallest ellipsoid containing the half-ellipsoid 
{A G E t : (A — A f , B — B t ) < 0} restricted to the affine space A = x = 0. 

hi. t = t + 1 . 

4. Let T = t and compute £7- using the counting oracle such that fg (A7-) — e / 8 < £r < /e (At) + e / 8 - 

5. IfCr<C 

(a) then return Xj and stop. 

(b) else return fg (A*) > ^ and stop (£ is not a good guess for /e (A*)). 

We first show that the algorithm can be implemented using a polynomial number of queries to the approxi- 
mate oracle. Steps d3bl , ( |3(c)iiA 1 and © can be computed using oracle calls to the generalized approximate 



counting oracle for Jt to obtain Z x < such that (1 -e/\6)Z x ' <Z x ' < (l + e/i6)Z Af . We set Q = (A,,0) +lnZ A '. 
A simple calculation then shows that 

M^)-|<&</«(^) + | 

since /e(Aj) = (X t ,B) +\nZ^' . Similarly using one oracle call to the counting oracle with error parameter 
-jl^r, we can compute 

lift- 0*1. s£ 

as needed in Step ( |3(c)iiA[ ) of the algorithm. Using Theorem 12 .14[ the number of iterations can be bounded 
by a polynomial in m,log s /e. The analysis is quite standard and omitted. Each of the oracle call can be 
implemented in time polynomial in m,R and i/e. We now show the following lemma which completes the 
proof of Theorem 12. 8 1 

Lemma 6.3 Let C,° be the smallest guess for which the ellipsoid algorithm succeeds in finding a solution 
and let A° denote the corresponding solution returned. Then, /e(A°) < /e(A*) + £. 

Proof: Observe that we have 

/e(A )-|<C </e(A ) + |. 

Since £° is the smallest guess for which the algorithm succeeds in returning an answer, it fails for some 
£ G [£° — e /8, £°]. We show that /e(A*) > £ — e /4. This suffices to prove the theorem since 

/ e (A°)<C° + f <£ + f </e(A*) + |. 



Suppose for the sake of contradiction that 



/ fl (r)<c-|. (4) 



We then show that A* must be in the final ellipsoid Ej when the ellipsoid algorithm is run with guess £. Let 
Af be center of the ellipsoid E, in any iteration with guess £. Let ^ be computed in Step (l3bl such that 

/e(A,)-|<Cr</e(A f ) + |. 
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Since the algorithm does not return any answer with guess £, we must have Q > £. Thus, 

/«>(*,)> &-§>C-§>/e(A*) + § 



where the last inequality follows from inequality ©. Let 6, be computed in Step ( |3(c)iiA| ) of the algorithm 
such that || d, - e^Hi < 3^. But then 

(V-kt,e-e t ) = {X*-x t ,e-e Xi ) + {X*-x t ,e^-e t ) 

< fe(X*)-f e (X t ) + (X*-X t ,6^-e t ) 

< -§ + 1| x* -XtWwe^-Qt || 



< -^- + 2R-^-<0 
8 32/? ~ 

where the first inequality follows from convexity of /. Thus, X* satisfies the separating constraint put in 
Step ( |3(c)iiB| ) of the algorithm. Therefore, it must be contained in the final ellipsoid Ej. Let Xj be the center 
of the ellipsoid Ej. Let Cj be computed such that \C, T — fe(Xj)\ < e / 8 - Then 

& < / e (Ar) + |</<,(A*) + v ^7||A*-A r || + | 

where we use Lemma l6\2l Therefore, the algorithm must have returned X° = Xj as the feasible solution for 
guess C, a contradiction. This completes the proof of Lemma 16.31 ■ 

7 Counting via Optimization 

In this section we present the proof of Theorem 12. 1 1 1 We start by phrasing the problem of estimating \j$\ 
as a convex optimization problem. 

7.1 A Convex Program for Counting 

Let g(d) denote the optimum of the max-entropy program of Figure |4]for ^ and a point 6 in P(^#). If 6 
is in the interior of P(^), then strong duality holds for this convex program and 

g(e) = mff e (X), 

A, 

see Lemma 1231 By the concavity of the Shannon entropy, g(-) can be easily seen to be a concave function 
of 6 . Recall that in the setting of Theorem 12.111 we have an access to an approximate max-entropy oracle 
which, given a 6 in the ^-interior of P(^Jtf), a C, > as a guess for g(6), and an e > 0, either asserts that 
g(9) > £ — £ or returns a X such that fg (A) < £ + £. The running time of this oracle is polynomial in m, 1 /e 
and in the number of bits needed to represent £. Using this oracle, we hope to get an estimate on |^#|. The 
starting point of the proof is the observation that the point that maximizes g(d) when 6 is in P(^) is 

a* def ^ i 

the vertex centroid of P{^). 

Lemma 7.1 sup eeP ^g(6) <\xv\Jl\ and g(Q*) = In 
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Proof: For any 6 in the interior of P{^(), g(6) is the entropy of some probability distribution over the 
elements in \j&\. A standard fact in information theory implies that the maximum entropy of any distribution 
over a finite set is obtained by the uniform distribution. The entropy of the uniform distribution on ^M is 
In \j&\, hence, g(d) can be upper bounded by In |^#|. On the other hand, the uniform distribution over ^# 
has marginals equal to 6* and, thus, g(d*) = In |^#| . ■ 

Thus, if we could find g(6*), then we can estimate |^#|. Finding g(6*) is the same as solving the convex 
program 

sup g(6). 

We use the framework of the ellipsoid method to approximately solve this convex program and find a point 
which gives us a good enough estimate to g(6*). Note that, unlike the results in the previous section, the 
bounding box here is easily obtained since 6 E P(^#) C [0, 1]"\ 

7.2 The Interior of P(J?) 

To check how close a candidate point 6 is to Q* , we use the max-entropy separation oracle provided to 
us. The main difficulty we encounter is that the running time of the max-entropy oracle with marginals 
6 is inverse-polynomially dependent on the interiority of the point 6 . Note that inferiority of 6 is a pre- 
requisite for strong duality to hold and for a succinct representation of the entropy-maximizing probability 
distribution to exist as in Lemma 1231 The reason we assume a max-entropy oracle that works only if 8 is in 
the inverse-polynomial interior of P(^#) is that such an oracle is the best we can hope for algorithmically 
and, indeed, Theorems 12.61 and l2.8l provide such an oracle. Without this restriction the proof of Theorem 
12.111 is simpler, but the theorem itself is less useful as there may not exist a max-entropy oracle whose 
running time does not depend on the interiority of 6. 

The first issue raised by interiority is whether the point we are looking for, 6* may not be in the inverse- 
polynomial interior of P(^). To tackle this, we show that there is a point 6' in the r\ -interior of P(^) for 
T] = poly(e, i/m) such that g(d') ~ g(6*) = In |^#| . Thus, instead of aiming for 6*, the ellipsoid algorithm 
aims for 6*. 

Lemma 7.2 Given an e > 0, there exists an r\ > and 6' such that B° is in r\ -interior of P(^f) and 
8 (9°) > ( 1 — if - ) In |^#| > In |^#| — j\. Moreover, r\ is at least a polynomial in l /m and e. 

Before we prove this lemma, we show that there exists some point in the poly( 1 /m) interior ofP{jK). Such 
a point is then used to show the existence of 6* . Note that we do not need to bound g(6) and only use the 
fact that it is non-negative. 

Lemma 7.3 (Same as Lemma IXIb Let .Ji C {0, 1}'" andP(^) = [x G R^ : A=x = b, A-x < c} be such 
that all the entries in A-,c G p- • Z and their absolute values are at most k u . Then there exists add P(^#) 
such that 6 is in the r-r — rs-interior of ' P(^/K). Thus, ifk u ,ki = poly(m), then 6 is in poly {}/m)-interior of 

P{Jt). 

Proof: Let r be the dimension of P(jfc). Then, there exist r + l affinely independent vertices zo, ■ ■ ■ ,Z r £ 
{0, 1}'". We claim that 6 = ^ry LJ=o z ' satisfies the conclusion of the lemma. Let 

F i = {x£P(^/):Afx = c i } 
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be a facet of F(^#) where the inequality constraint is one of (A-,c). Since the dimension of a facet is one 

less than that of a polytope, at least one of zo,Zi, ■ ■ ■ ,z r , say zo, does not lie in F and, hence, AfZo < c\. 

Therefore, 

< 1 

AfZQ <Ci- — 

since all coefficients are i/fc, -integral. Thus, the distance of zo from F is at least 

1 1 

> 



ki\\Af\\ hk u y/m' 

Hence, the distance of 6 from F is at least — • , } /- ■ Since this argument works for any facet F, the distance 
of 6 from every facet of P{^) is at least ^ • k k l j= . ■ 

Henceforth, we assume that ki,k u = poly(m). 

Proof: [Proof of Lemma |7l2"l l Let 6 be the point in the interior of P(^#) as guaranteed by Lemma 1731 
Consider the point 

V 16m 3 / 16m 3 

Since 6 is in the poly ( l /m,e) -interior of F(^#), 6' must also be in the poly ( l /m,e) -interior of P[y^f). On 
the other hand, since g(-) is a concave and non-negative function of 6, we have that 

«<••>* (» -is?) ■«<•*> *«<•*>-£■ 

where we used the fact that In |^#| < m. 



7.3 A Separation Oracle for Inferiority 

Our final ingredient is a test for checking whether a point 6 is in the inverse-polynomial interior of P(„#) . 
We show that the separation oracle for P(^#) can be used to give such a test. We state the result in generality 
for any polyhedron P. For any f] > 0, let 

def 

Pjj = {x : y G P Vy such that |[jc — y]| < r/} 

denote the set of r\ -interior points in P. 

Lemma 7.4 There exists an algorithm that given a separation oracle for a polyhedron P, a set of maximal 
linearly independent equalities {A = ,b) satisfied by P, an K] > 0, and 6 € M m , either 

1. asserts that 6 G P)/ 2m > i- e -> ' s m tne r l/2m-interior ofP, or, 

2. returns a such that (a,y) < (a, 6) for each y G P^, or equivalently, a separating hyperplane which 
separates the T] -interior of P from 6. 

Proof: We use the the separation oracle for P on a collection of a small number of points close to 8 to 
deduce if 6 is in the interior of P. Even if one of these points is not in P, we use a separating hyperplane for 
such a point to separate 6 from the interior of P. First, we describe the procedure when P is full dimensional. 
Let xq, ...x m form an ^-regular simplex with center 6, i.e., 

1 '" 
y\xi = 8 and ||jc,- —Xj\\ = f] 



m ;=o 



24 



for each i ^ j. Such xq,... ,x,„ can be found by starting with a regular simplex and then translating and 
scaling it. Now, the algorithm applies the separation oracle for each of X[. Suppose the separation oracle 
asserts that Xj G P for each < / < m. In this case, we assert that 8 G /W,. Observe that since each vertex 
of the simplex is in P, we must have that the whole simplex is in P. Since the simplex is regular where each 
edge is length r\ and the center is 8, there exists a ball of radius £- centered at 8 which is contained in the 
simplex and, hence, in P. Thus, 8 is in ^-interior of P as asserted. 

Now, suppose that %i ^ P for some i and let a be the separating hyperplane, i.e., (a,y) < (a,Xj) for each 
y G P. Then consider the constraint (a,y) < (a, 6). We claim that it is satisfied by each y G P^. Let y G Pjj 
and consider 

/ def 

y =y-d+Xj. 
Since \\8 — x,-|| < f] and y G P^, we have that / G P which implies that {a,y'} < (a,*,). But this implies that 

(a,y) = (a,y -Xi + d) = (a,y) - (a,x t ) + (a, 6) < {a, 6) 

which gives us the required separating hyperplane. 

Now consider the case where P is not full dimensional and let r be the dimension of P. Recall that in 
this case we define interior of P by restricting our attention to points in the affine space {x : A = x = c}. We 
modify the algorithm to chose a r-dimensional simplex in this affine space and check whether each of the 
vertices of the simplex is in P. The analysis is identical in this case. ■ 

7.4 The Ellipsoid Algorithm for Theorem I2TT1 

Now we present the ellipsoid algorithm to approximately solve the convex program ram eeP i^g^g{8) and 
prove Theorem 12.111 The stalling ellipsoid is a ball of radius ^fm that contains [0, 1]'" which contains 
P(^#) . Let us fix an e > and apply Lemma 17.21 to obtain tj which is a polynomial in l /m and e and 
guarantees the existence of 8' in the Tj-interior such that g(d') > ln|^#| — yg. In the range (0,ln|^f |] we 
perform a binary search for the highest £ such that the set 

5(C,r 7 ) = {0GP 77 (^): 5 (0)>C} 

is non-empty when we search £ within an accuracy e/i6. 

Given a guess £ for g(d'), at an iteration t of the ellipsoid algorithm, we use the center d t of the 
ellipsoid as a guess for 6*. Ideally, we would pass 6 t to the max-entropy oracle which would either assert 
that g(d t ) > £ — e /i6 or returns a X t such that fe t (Af ) < C + e / 16 - I n the first case we stop and return d t . In the 
latter case, we continue the search and use this X t returned by the max-entropy oracle to update the ellipsoid 
into one with a smaller volume. However, to get the guarantee on the running time, we need to first check 
that the candidate point 6 t is in the T] -interior of P{^). Here, we use the separation oracle from Lemma 
17.41 We proceed to the max-entropy oracle only if this separation oracle asserts that the point 6 t is in the 
l/^m-interior of P(^tf). In case this separation oracle outputs a hyperplane separating 6 t from P^ (./#), we 
use this hyperplane to update the ellipsoid. The key technical fact we show is that when |£ — g{6')\ < e /i6, 
6* is always contained in every ellipsoid. Thus, once the radius of the ellipsoid becomes small enough, we 
can output its center as a guess for B* . Since the radius of the final ellipsoid is small and contains 8* , the 
following lemma implies that the value of g(-) at the center of the ellipsoid is close enough to g(d') and, 
hence, by Lemma |7T2l to g(d*). 

Lemma 7.5 Let 8,8' G P(J?) such that \\8 - 8'\\ < e and 8 is in r\-interior ofP{Jt). Then 

g(8')>(l-^)g(8). 
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Proof: Let A* be an optimal solution to inf^/e(A). Thus, p x * is the optimal solution to primal convex 
program and H(p^*) = inf^ /e(A). We construct a probability distribution q which is feasible for the primal 
convex program with parameter 6' and H(q) > (1 — E /r])H(p^*), thus, proving the lemma. We begin with a 
claim. 

Claim 7.6 Let 6" = e '-^/i) , Then d" G P{JZ). 

Proof: First, observe that any equality constraint for P(^) of the form {Ay,x} = b\ is satisfied by both 6 
and 6'. Therefore, 

(A=) Qll) = {Ay,d')-(l-^){A=,d) = bi-il-'Mbi = ^ 
Thus, it is enough to show that ||0" — 0|| < f}. To see this note that 

e'-e 



\e"-d\ 



9>-(l-e/r,)0_ e 



where the last inequality follows from the fact that ||0 — Q'\\ < e 



e H 



e/Tj 



Let q" be an arbitrary probability measure over ^ such that the marginals of q" equal 0", i.e., 6" = 
Y.Me.^\eeM c l" m- Let q be the probability measure defined to be 

Then 

£ q M ={l-e/ n )e e + e/ ri 6'J = e' e . 

By concavity and non-negativity of the entropy function, we have 

H(q) > (l-e/ V )H(p V ) + £/nH(q") > (l-e/ v )H(p**) 
as required. ■ 

We now move on to the description of the ellipsoid algorithm and subsequently complete the proof of 
Theorem 12.1 II 

1. Input 

(a) An error parameter e > 0. 

(b) A maximally linearly independent set of equalities (A = ,b) for P{^£). 

(c) A separation oracle for the facets (A-,c) of P{^K). 

(d) A max-entropy oracle for JK. 

(e) A guess f £ (0,m] (for g(6')). 

2. Initialization 

(a) Let f] be as in Lemma IT2l 

(b) Let Eq = E(Bo,co) be a sphere with radius R = y/m containing [0, 1]'" restricted to the affine 
space A = x = b. 

26 



(c) Set? = 

3. Repeat until the ellipsoid E t is contained in a ball of radius at most j£L. 

(a) Given the ellipsoid E t = E(B t ,c t ), set B, = c t . 

(b) Check using the separation oracle for P{^) as in Lemma 17741 if d t G Pn_ (^#) 

2m 

i. then goto Step (l3cl) 

ii. else let e, be the separating hyperplane returned as in Lemma |7~41 i.e., (e t , 9 — 9 t ) > for 
all d G P r] {Jt) and goto Step (l3el 

(c) Call the max-entropy oracle with input 6 t , £ and e /i6. 

(d) lfg(e t )>C-e/l6 

i. then return 6 t and stop. 
ii. else the max-entropy oracle returns Xt such that fg l (A,) < £ + e /i6. Let e t = 'k t . 

(e) Compute the ellipsoid Zs f+ i to be the smallest ellipsoid containing the half-ellipsoid {6 S E t : 
(e t ,6 — 6,} > 0} and restricted to the affine space A = x = Z>. 

(f) f = f + 1 . 

4. Let r = t call the max-entropy oracle with input 6t, £ and e /i6. 

5. ifg(e r )>c- e /i6 

(a) then return 0^ and stop. 

(b) else return g(d') < t, and stop (£ is not a good guess for g(0*)). 

It is clear that any call to the approximate optimization oracle is made for points 6 which are in ^jL interior. 
Thus, the running time of the algorithm is polynomially bounded by m and i/e for each call. To bound 
the number of iterations note that the starting ellipsoid has radius y/m and the final ellipsoid poly( 1 /m,e). 
Hence, the number of iterations can be bounded by Theorem 12. 141 by poly(m, J /e). It remains to prove the 
correctness of the algorithm. 

Towards this, let £° be the largest guess of £ for which the algorithms returns a positive answer and let 

6° be the point returned by the algorithm for guess £°. We return Z° = e^° as our estimate of |^#|. To 
complete the proof of Theorme 12.111 we show that Z° satisfies 

(l-e)\Jt\<ZT<(l + e)\Jt\. (5) 

First, we prove the following lemma. 

Lemma 7.7 Consider the run of the ellipsoid algorithm for a guess C, and let the hyperplane {6 : (e t , G — 
9t) ^0} be used as a separating hyperplane in some iteration of the algorithm. Then this separating 
hyperplane does not cut any point 6 such that G G Prj (^) and g(6) > £ + e /i6. 

Proof: If the hyperplane e t is obtained in Step ( |3(b)ii[ ), then it is clearly a valid inequality for P r] (^#) and 
therefore does not cut off any of its points. Otherwise, suppose e t = Xj is obtained in Step ( |3(d)ii[ ). Then 



Hence, 



/ e ,(A f ) = (A f ,e f )+lnZ A '<C + -. 



(Xt,6 - 9 t ) = Ml) -f e ,(X t ) > f e (Xt) - £ - ± (6) 
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Thus, by the assumption in the lemma, 

/e(A,)>g(0)>C + e /i6 
and, therefore, by ©, 6 satisfies the constraint (e t , 6 — 6 t ) > 0. 

We now show that £° > In |^#| — y|. Consider the run of the algorithm for 

C'€ 

Since 



ln |^l_l£, ln |^l_!£ 



g(0-)>ln|^|-^>C' + ^, 

0* cannot be cut off in any iteration by Lemma 17771 If the ellipsoid returns an answer when run with guess 

C then 

Ae 
C°>C >ln| 



16 

as claimed. Otherwise, we end with an ellipsoid Ej of radius at most yP-. Let 67- be the center of the 
ellipsoid Ej. Since 6' G £r, we have that ||0* — 6 T \\ < ^. Since 6' is in 77-interior, from Lemma 1731 it 
follows that 

g(0T)>(l-^)g(0')>(l--?-) g (0')>ln\ 



T] / V 16m/ 16 

This contradicts the fact that the algorithm did not output dj m the last iteration and asserted 

g(0r)<C' + ^<lnK|-|. 

Since C° > ln|^f| - f§ and C° < ln|^| + ^. We obtain that (\-e)\JK\ < Z° < (\ + e)\^\ proving © 
and completing the proof of Theorem 12.111 
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A Omitted Proofs 

A.l Duality of the Max-Entropy Program 

Lemma A.l For a point G in the interior ofP(^), there exists a unique distribution p* which attains the 
max-entropy while satisfying 

Yj Pm 1 m = 0. 
Me.M 

Moreover, there exists A* : [m] \-t R such that p M <x e ( M > for each M G ^#. 

Proof: Consider the convex program for computing the maximum-entropy distribution with marginals 6 
as in Figure [4] We first prove that the dual of this convex program is the one given in Figure [5] To see 

sup Lmg^Pm hi ^ 
s.t. 
\fe G [m] I^Me^MBePM = e (V) 

Lmg.-# Pm = 1 (8) 

VM G J( p M > (9) 

Figure 4: Max-Entropy Program for (^#, 6) 



inf f e (A) = £ ee[m] 6 e X e + ln£ Me ^ e^M 
s.t. 
VeG[w] A e GR (10) 

Figure 5: Dual of the Max-Entropy Program for (^#, 8) 

this consider multipliers X e for constraints (0 in Figure @]and a multiplier z for the constraint d8j. Then the 
Lagrangian L(p,X,z) is defined to be 

£ PMln h £ k e (0 e - £ Pm)+z{1- £ Pm)- 

MeJ( P M ee[m] Me.£,M3e MzJZ 

This is the same as 

£ /7 M ln £ pmHM)-z £ Pm+ £ AA + Z- (11) 

MgJ' ^ m MgJ' Mg^# ee[m] 

del 

Let g(A,z) = inf p >oL(p, A,z). Thus, the p which achieves g(A,z) can be obtained by taking partial deriva- 
tives with respect to pm and setting them to as follows. 

VMG^f -= — =ln l-A(M)-z = 0. (12) 

apM Pm 
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Thus, pm = e : z A< ^ for all M £ J{ . Summing this up over all Me J'we obtain that 

I PM=e^ £ e-W). (13) 

Me./Z MeJ/ 

For such a (p, X, z), if we multiply each (fT2l) by /?m and add all of them up we obtain 



which implies that 



V PA/ln Pm ~ PmHM) - zpm ) =0, 



£ (pMln pmA(M)-zpm)= XI Pm- 

tfe^K \ Pm J M9 m 



Me.# \ fm / M ^j( 

Hence, combining this with (fTTT > and using (fT3~T ). the dual becomes to find the infimum of g(X,z) which is 

-X{M) 
Me. 4 



£ A^ + z + tT 1 -* £ e" 



Optimizing g(X,z) over z one obtains that g(X,z) is minimized when 

l_ e -i-z £ g -M«) = o. 

Hence, z = ln£ MG „#e ^ — 1- Thus, the Lagrangian dual becomes to minimize 

e6[m] MeJ( 

This completes the proof that the dual of Figure [5]is the convex program in Figure |4] 

Since 6 is in the interior of P(^f), the primal-dual pair satisfies Slater's condition and strong duality 
holds, see H, implying that the optimum of both the programs is the same. Moreover, by the strict concavity 
of the entropy function, the optimum is unique. Hence, at optimality, p M = =-^ — _ A * (W) where X* is the 
optimal dual solution and p* is the optimal primal solution. ■ 

A.2 Optimal and Near-Optimal Dual Solutions 

In this section we first prove that if A is a solution to the program in Figure |5]for (^#, 6) of value £, then so 
is any X + (A = ) T d for any d. Recall that (A = ,b) are the equality constraints satisfied by all vertices of ^#. 
Hence, in our search for the optimal solution to the dual convex program, we restrict ourselves to the space 
of As.t. A=X=0. 

Lemma A.2 (Same as Lemma 123b fe(X) = fe(X + (A = ) T d) for any d. 

Proof: First, note that (X + {A=) T d, d) = (X, d) + ((A=) T d, d). Note that 6 can be written as "Lm^J/Pm^m 
and A = \ M = b for all M G ,-#. Hence, 

(X + (A=) T d,d) = (X,d)+ £ p M {d,b) = {X,e) + {d,b) 

MeJ( 

since Y.Me.4: Pm = 1 • On the other hand note that 

In V e -^+(.A=) J d,iM) =Xne -(d.b) y e -(l,i M ) 
Me.£ Me.// 
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which equals 

-(d,b) +ln £ e-^^ = -(d,b) + ln £ e^ M) . 
Me.£ Me.// 

Combining, we obtain that fg(X + (A = ) T d) equals 

{X + {A=) T d,d)+ln £ e-^ + ( A= ^ d ^ = (X,d)+ln £ e~ x M 

MeJ( MeM 

which equals f$(X). This completes the proof of the lemma. ■ 

Thus, we can assume that A = X* = where A* is the optimal solution for the program of Figure |5]f or (^#, 6). 
Next we prove that if X is such that fg (X ) is close to fg (X*) , then p x and p x * are close to each other. We 
relate the Kullback-Leibler distance between p x and//* to/e(A) — fg(X*). In particular 6 X and 6 are close 
to each other. Before we state this lemma, we recall some basic measures of proximity between probability 
distributions. 

Definition A.3 Let p,q be two probability distributions over the same space CI. The following are natural 
measures of distances. 

def 

1. \\p-q\\TV = mzK S ca\p(S)-q(S)\. 

def 

2 - \\p-q\U=L(o\p((o)-q((o)\. 

3. If p,q > 0, then the Kullback-Leibler distance between them is defined to be 

DKL(p\\ q ) d =^p(co)m^. 

This distance function is always non-negative but not necessarily symmetric. 

The following lemma shows a close relation between the dual solutions and Kullback-Leibler distance be- 
tween the corresponding primal distributions. 

Lemma A.4 Suppose X is such that fe(X) < fe(X*) + £ where X* is the optimum dual solution for the 
instance (^#,0). Let p x ,p x be the probability distributions corresponding to X and X* respectively: 

X del g v ; , A* def e 

Pm — t; -urn ana Pm 



Y „-X(N) FM ' Y p-X*(N) • 

LNe.£ e v ; LNe^ e y ' 

Then 

fe(X)-f e {X*)=D KL (p x *\\p x ) = e. 

Proof: Let Z , Z denote Y*me^£ e an d Hme.^ e~ ( m > respectively. Then, it follows from optimality 

of X* that 

V a -X*(M) 
B e - -^ • (14) 

Hence, 

Dkl(p*\\p)= £ rf/ln— - £ p M \n — 

MeJt PM MeJt Pm 
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which, since pm = e /z x is 



This is equal to 



£ /4lnZ A + £ P* M HM)- £ 7^1n-^. 

Me.* mgjT Me^# "m 



lnZ A -/ e (r)+ £ A e £ /4 



= lnZ A -/ e (A*) + (A,0)=/ e (A)-/ e (A*). 
Here, we have used (fl4l . Hence, Dkl(/>*||p) = /e W - /e(^*) < £■ ■ 

It is well-known, see (H Lemma 12.6.1, pp. 300-301, that for probability distributions p,q over the same 
sample space 

iip-?iiTv<o(v^wpii*))- ( i5 ) 

Hence, we obtain the following as a corollary to Lemma |A~4] 

Corollary A.5 Let X be such that f e (X) < f e (k*) + e. Then, for all e G [m], \B^-B^*\ <0(y/e). 

Proof: This follows from the fact that 



I rjA nA I ^ II n X „A*|| 
\"e ~ "e I - \\P ~ P I TV 



which is at most 



o(Vd K iW)) =o(V/e(A)-/e(A*)) =0(Vi). 



B Generalized Counting and Minimizing Kullback-Leibler Divergence 

In this section, we outline how to obtain algorithms for generalized approximate counting from max-entropy 
oracles. Recall that the generalized approximate counting problem is: Given e > and weights /I G W", 
output Z^ and 7% for each e G [m] such that the following guarantees hold. 

1. (l-£)Z' J <Z><(l + £)Z' i and 

2. for every e G [m], (1 - £)Z e M < z] 1 < (1 + e)Z e A . 

Here Z^ is defined to be Y.M€j£ e ~ ■ The running time should be a polynomial in m, l /e, log 1 / a and 
the number of bits required to represent e~^ for any e G [m], or \\n\\i 1 10 l Towards constmcting such algo- 
rithms, we need access to oracles that solve a more general problem than the max-entropy problem used 
in Theorem 12. Ill namely, min-Kullback-Leibler (KL)-divergence problem. This raises the issue that, while 
Theorems !2.6l and l2T8l output solutions to max-entropy problem given access to generalized counting oracles, 
to obtain a generalized counting oracle we need access to a minimum KL-divergence oracle. However, later 
in this section we show that given a generalized counting oracle, we can not only solve the max-entropy 
convex programs as in Theorem 12.81 but also the min KL-divergence program in a straightforward manner. 
The convex program for the min- KL-divergence problem is given in Figure [6] Given a fi G M m , recall that p^ 

is the product distribution p 1 ^ = e ZM . Observe that the objective is to find a distribution p that minimizes 



The number of bits needed to represent e ^' foranyeG [m] , up to an additive error of 2 '" , is max{poly(m), ||ju||i} for some 
poly(m). Since all our running times depend on poly(m), we only track the dependence on ||/x||i . 
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SU P LMe.^PM^^^+\aZ^ 
s.t. 

Me G [m] Zmz. £, M3e PM = Oe 

Hme^Pm = 1 
VMG^ p M > 

Figure 6: Min-KL-Divergence Program for (./#, 0,/x) 

inf E ee[w] e A, + lnI Me .^ ^ W^W + lnZ" 
s.t. 

Ve G [m] A, G R 

Figure 7: Dual of the Min-KL-Divergence Program for (^#, 6, pi) 

the KL-divergence, up to a shift, between the distributions p and p^ . This follows since the objective can be 
rewritten as 

£ PM lnE^l + lnZ^=\nZ^-D KL (p\\p^ 

M^J( P M 

where Z^ does not depend on the variable p but only on the input /I. The dual of this convex program is 
given in Figure |7] We use the following to denote the objective function of the dual: 

ee[m] Me.^ 

When 6 is in the interior of P{ytf), strong duality holds between the programs of Figure [6] and |7] We 
assume that we are given the following approximate oracle to solve the above set of convex programs. 

Definition B.l An approximate KL-optimization oracle for ^#, given a 6 in the f] -interior ofP(^), ji G 
W n , a C, > 0, and an e > 0, either 

1. asserts that inf^ /r (A) > £ — E or 

2. returns a A G M m swc/i f/jaf /g (A ) < £ + £. 

77ie oracle is assumed to be efficient, i.e., it runs in time polynomial in m, i/e, l /r\, the number of bits needed 
to represent £ and ||/i||i. 

The following theorem is the appropriate generalization of Theorem 12. lll in this setting. 

Theorem B.2 There exists an algorithm that, given a maximal set of linearly independent equalities (A = ,b) 
and a separation oracle for P(^), a jl G M m , and an approximate KL-optimization oracle for j$ as above, 
returns a Z such that (1 — s)Z^ < Z < (1 + e)Z^ . Assuming that the running times of the separation oracle 
and the approximate KL oracle are polynomial in their respective input parameters, the running time of the 
algorithm is bounded by a polynomial in m, l /e, the number of bits needed to represent (A = ,b) and \\n\\i- 
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We omit the proof since it is a simple but tedious generalization of the proof of Theorem l2.11l We highlight 
below the key additional points that must be taken into account. The algorithm in the proof of Theorem IB .21 
is obtained by using the ellipsoid algorithm to maximize the concave function 

mas. ^(9) 



over the interior of P(^#). Here, g^(9) = min^/g (A). Indeed, the maximum is attained at 9* where 

L*M:eeM e 



e: 



Z' 1 



and the objective value at this maximum is InZ' 1 . Thus, we use the ellipsoid algorithm to search for 6*. The 
issue of interiority as in Lemma 17721 is resolved by proving the following lemma. 

Lemma B. 3 Given an e > 0, there exists an r\ > and 6* such that 6' is in r\ -interior of P(^#) and 
g fl (9') > ( 1 — 16m n | | ) lnZ^ > lnZ^ — yg. Moreover, r\ is at least a polynomial in 1/m, £ and Vll^lli- 

Similarly, Lemma |731 can be generalized to show the following: 

Lemma B.4 Let 9,9' G />(„#) such that \\9 — 9'\\ < £ and 9 is in r\-interior ofP(^£). Then 

s " (e ' )£ ('-^) s(e) ' 

Using the above two lemmas it is straightforward to generalize the argument in Theorem 12.111 to prove 
Theorem IB .21 

To complete the picture we show that, given access to a generalized counting oracle, we can solve the 
above pair of convex programs. This gives the following theorem which is a generalization of Theorem [ 



Theorem B.5 There exists an algorithm that, given a maximal set of linearly independent equalities (A = ,b) 
and a generalized approximate counting oracle for P(^) C W n , a 9 in the T] -interior of P(^), jX S M. m 
and an £ > 0, returns a A° such that 

fZ(r)<f%(X*) + e, 

where X* is the optimal solution to the dual of the max-entropy convex program for (<J?,9,n) from Figure 
Assuming that the generalized approximate counting oracle is polynomial in its input parameters, the 
running time of the algorithm is polynomial in m, [ /r), log i/e, the number of bits needed to represent 9 and 

(A=,b), and \\n\\i. 



A similar theorem can be stated in the exact counting oracle setting, extending Theorem 12.61 The proof of 
Theorem IB .5 1 is quite straightforward and relies on the following lemma that states that the objective of the 
primal convex program is just an additive shift from the objective of the maximum entropy convex program. 
Thus, the primal optimum solution remains the same. This implies that the dual convex program can be 
solved as in the proof of Theorem 12.81 with one additional call to the generalized approximate counting 
oracle involving jUs. 

Lemma B.6 Let p be any feasible solution to the primal convex program given in Figure [6] Then the 
objective 

£ p M ln^+lnZ"= £ PM \n— -(9,ll). 

MeJt Pm Me.# P M 
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Proof: We have 

// 1 

£ p M ln^ + lnZ' 1 = £ PM \n — + £ pMlnp^ + lnZ^ 

M6^ ^ M M^J( P M MeJt 

^ 1 w-h e-^ 

= £ /;>Mln h L ^ ln +lnZ^ 

= £ p M ln — + £ p M (-/x(M))+ £ puZ^ + lnZV 

MeJ' ^ M mgJ' Me^ 

= £ />Mln £jU e £ ;?m 

MgJ' ^ m ee£ M:eeM 
= £ Z^Mln £jMe 

where we have used the facts that £ M pm = 1 and Y,M-.eeM Pm = 0« since p satisfies the constraints of the 
convex program in Figure [6] ■ 

C 2^2-normofV/ 

For a fixed 0, let 

/(A) d ^(M>+ln £ e-^ M \ (16) 

Recall that ||V/|| 2 ^ 2 is the 2 — > 2 Lipschitz constant of V/ and is defined to be the smallest non-negative 
number such that 

||V/(A)-V/(A')|| 2 <||V/|| 2 ^ 2 .||A-A'|| 2 

for all A, A'. In this section, we show the following theorem, which can be used to give alternative gradient- 
descent based proofs of Theorems I2.6l and 12.81 

Theorem C.l Let f be defined as in ( fTBT ). Then, ||V/|| 2 ^ 2 < 0{m^fm). 

Proof: Given X\ and A 2 , let p ' and p 2 be the corresponding product distributions and let 0\ and 2 be the 
corresponding marginals. We break the calculation of || V/|| 2 ^ 2 into two parts: 

||Ai-A 2 || 2 > — —= and ||Ai-A 2 || 2 < 



\0y/hl " " VQy/m 

Estimating ||V/||2->2 i n the fi rst case is straightforward. To see this, recall that 

vf{Xi) = e-ei (n) 

for i = 1,2. Thus, 

I|VM0I| 2 <V^ (is) 

since 0, £ P{JT) C [0, l] m , which imphes that - 0,- e [-1, l] m for / = 1,2. Hence, 

||V/(Ai)-V/(A 2 )|| 2 <2^. 

Thus, when 

||A 1 -A 2 || 2 >— L=, (19) 

IOa/'m 
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we obtain 

A-ineq. GD ^_ 1 {H} 

||V/(Ai)-V/(A 2 )|| 2 < ||V/(Ai)|| 2 + ||V/(A 2 )|| 2 < 2v^ = 20m-— -= < 20m- ||Ai -A 2 || 2 . 
Hence, we move on to proving the theorem in the case 

\\k-hh<T^i= (20) 

Towards this, define 

e = y/m\\Xi -A 2 || 2 . (21) 

Note that by assumption (1201) . e < i/io. It follows from (|2TT > that for any M G ^f, 



|Ai(M)-A2(M)| 



I((Ai) e -(A 2 ) e ) 



eeM 



A-ineq. Cau.-Sch. 

< £|(Ai) e -(^) e |<||Ai-^||i < V^\\X 1 -hh = e. 

eeM 

(22) 



The following series of claims establishes Theorem IC.ll 
Claim C.2 e~ e < || < e e . 

Proof: For each MeJ, d22j) implies that 

-Xx(M) 
e ~ E <-^l^ < e " '■ (23) 

- e -A 2 (M) — v 

Thus, 

e- £ < min —— - < LMe ^ . -- < max ^— - - < e e . (24) 

-Me^-g-M^) ~ Img.^*? 2( } ~Me^e- A 2(M) ~ 

Here, we have used the inequality that for non-negative numbers a\ , a 2 , . . . and Z?i , # 2 , . . . , 

mm — < — L — < max — . (25) 

i bj Y.i b i ' bj 



The claim follows by combining (|24|) with the definition 



Claim C.3 For eac/z MeJ, e~ 2e < £f < e 2e . 

Pm 



Proof: By definition, 



h g-Ai(M) 2^2 



A 2 e -X 2 (M) 7A1 ' 



Since all the numbers involved in this product are positive, (|23T > and Claim |C2l imply that both the ratios 
in the right hand side of the equation are bounded from below by e~ e and from above by e e . Hence, their 
product is bounded from below by e~ 2e and from above by e 2e , completing the proof of the claim. ■ 
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Claim C.4 For each e e [ml, e~ 2e < Jj& < e 2e . 
Proof: By the definition of 6\ and 2 , 

(^l)e _ LM6./#,M9e j?M 

Combining Claim |C31 and (1251) . we obtain that 

^< min ^f < ^M^MBePM < max ^< g 2 £ 

Me ^^ e P% L M e^M3e PA Mey/M3e Pm " 
completing the proof of the claim. 

Claim C.5 For e < 1/10, ||0i - 2 ||i < 3em. 

Proof: By definition 

||0i-02||i= £ |(0i) e -(02) e | 

ee[ra] 
Since 0i, 2 > 0, Claim IC4l implies that for each e E [m], 

( g -2e _ 1)(02)g < (0l)e _ (02)e < (e 2e _ 1)(02)g 

Since max{|e~ 2e — 1|, \e 2e — 1|} < 3£ for £ < 1/10, the above inequality reduces to, for each e S [m], 

|(0i) e -(02) e |<3e(02) e . 

Thus, 

||0i-0 2 ||i= £ \{e l ) e -{d 2 )e\< £ 3e(0 2 ) e <3em 

ee[m] ee[m] 

since 62 > and 02 E P{^) ^ [0, 1]"'. This completes the proof of the claim. 
Finally, to complete the proof of Theorem lC.il note that when (l20l > holds, 

IT7I <aM<<lM Claim |C.4| JtH ^ 

l|V/(A 1 )-V/(A 2 )|| 2 ^'||0 1 -0 2 || 2 < ||0i-0 2 ||i < 3me^3inVm||Ai-^||2. 
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